Introduction to the dataset that EDA will be carried on:
Last updated 03/11/2014
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
There are MANY variables in this data set and you are not expected to explore all of them. You should explore between 10-15 variables in your analysis. Expected time to analyze the dataset 15-30 hours
Since the dataset is to vast to explore in this excercise I will intimately examine only 10-15 variables and create a dataframe from the relevant selection.
This variable dictionary explains the variables in the subset of the data set that I will explore:
listing keys; Unique key for each listing, same value as the ‘key’ used in the listing object in the API.
BorrowerState: The two letter abbreviation of the state of the address of the
Term: The length of the loan expressed in months.
LoanStatus: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
LoanOriginalAmount: The origination amount of the loan.
ClosedDate: Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
BorrowerRate: The Borrower’s interest rate for this loan.
ListingCategory..numeric.: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
Occupation: The Occupation selected by the Borrower at the time they created the listing.
EmploymentStatus: The employment status of the borrower at the time they posted the listing.
IsBorrowerHomeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
IncomeRange: The income range of the borrower at the time the listing was created.
DebtToIncomeRatio_ The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
IncomeVerifiable: The borrower indicated they have the required documentation to support their income.
LoanOriginationDate- date the loan was originated
## ListingKey LoanOriginalAmount BorrowerState Term LoanStatus
## 1 1021339766868145413AB3B 9425 CO 36 Completed
## 2 10273602499503308B223C1 10000 CO 36 Current
## 3 0EE9337825851032864889A 3001 GA 36 Completed
## 4 0EF5356002482715299901A 10000 GA 36 Current
## 5 0F023589499656230C5E3E2 15000 MN 36 Current
## 6 0F05359734824199381F61D 15000 NM 60 Current
## ClosedDate BorrowerRate ListingCategory..numeric. Occupation
## 1 2009-08-14 00:00:00 0.1580 0 Other
## 2 <NA> 0.0920 2 Professional
## 3 2009-12-17 00:00:00 0.2750 0 Other
## 4 <NA> 0.0974 16 Skilled Labor
## 5 <NA> 0.2085 2 Executive
## 6 <NA> 0.1314 1 Professional
## EmploymentStatus IsBorrowerHomeowner IncomeRange DebtToIncomeRatio
## 1 Self-employed True $25,000-49,999 0.17
## 2 Employed False $50,000-74,999 0.18
## 3 Not available False Not displayed 0.06
## 4 Employed True $25,000-49,999 0.15
## 5 Employed True $100,000+ 0.26
## 6 Employed True $100,000+ 0.36
## StateCapital LoanOriginationDate
## 1 COLORADO 2007-09-12 00:00:00
## 2 COLORADO 2014-03-03 00:00:00
## 3 GEORGIA 2007-01-17 00:00:00
## 4 GEORGIA 2012-11-01 00:00:00
## 5 MINNESOTA 2013-09-20 00:00:00
## 6 NEW MEXICO 2013-12-24 00:00:00
## variable q_zeros p_zeros q_na p_na q_inf p_inf
## 1 ListingKey 0 0.00 0 0.00 0 0
## 2 LoanOriginalAmount 0 0.00 0 0.00 0 0
## 3 BorrowerState 0 0.00 5515 4.84 0 0
## 4 Term 0 0.00 0 0.00 0 0
## 5 LoanStatus 0 0.00 0 0.00 0 0
## 6 ClosedDate 0 0.00 58848 51.65 0 0
## 7 BorrowerRate 8 0.01 0 0.00 0 0
## 8 ListingCategory..numeric. 16965 14.89 0 0.00 0 0
## 9 Occupation 0 0.00 3588 3.15 0 0
## 10 EmploymentStatus 0 0.00 2255 1.98 0 0
## 11 IsBorrowerHomeowner 0 0.00 0 0.00 0 0
## 12 IncomeRange 0 0.00 0 0.00 0 0
## 13 DebtToIncomeRatio 19 0.02 8554 7.51 0 0
## 14 StateCapital 0 0.00 5897 5.18 0 0
## 15 LoanOriginationDate 0 0.00 0 0.00 0 0
## type unique
## 1 factor 113066
## 2 integer 2468
## 3 character 51
## 4 integer 3
## 5 factor 12
## 6 factor 2802
## 7 numeric 2294
## 8 integer 21
## 9 factor 67
## 10 factor 8
## 11 factor 2
## 12 factor 8
## 13 numeric 1207
## 14 factor 50
## 15 factor 1873
From the variables I will explore, BorrowerState, Occupation, EmploymentStatus, DebtToIncomeRatio and State Capital have percentage of missing observations in magnitude of up to 7.5%. CLosedDate for the loans has the biggest loss of observations, 51.6%
## ListingKey LoanOriginalAmount BorrowerState
## 17A93590655669644DB4C06: 6 Min. : 1000 Length:113937
## 349D3587495831350F0F648: 4 1st Qu.: 4000 Class :character
## 47C1359638497431975670B: 4 Median : 6500 Mode :character
## 8474358854651984137201C: 4 Mean : 8337
## DE8535960513435199406CE: 4 3rd Qu.:12000
## 04C13599434217079754AEE: 3 Max. :35000
## (Other) :113912
## Term LoanStatus ClosedDate
## Min. :12.00 Current :56576 2014-03-04 00:00:00: 105
## 1st Qu.:36.00 Completed :38074 2014-02-19 00:00:00: 100
## Median :36.00 Chargedoff :11992 2014-02-11 00:00:00: 92
## Mean :40.83 Defaulted : 5018 2012-10-30 00:00:00: 81
## 3rd Qu.:36.00 Past Due (1-15 days) : 806 2013-02-26 00:00:00: 78
## Max. :60.00 Past Due (31-60 days): 363 (Other) :54633
## (Other) : 1108 NA's :58848
## BorrowerRate ListingCategory..numeric. Occupation
## Min. :0.0000 Min. : 0.000 Other :28617
## 1st Qu.:0.1340 1st Qu.: 1.000 Professional :13628
## Median :0.1840 Median : 1.000 Computer Programmer: 4478
## Mean :0.1928 Mean : 2.774 Executive : 4311
## 3rd Qu.:0.2500 3rd Qu.: 3.000 Teacher : 3759
## Max. :0.4975 Max. :20.000 (Other) :55556
## NA's : 3588
## EmploymentStatus IsBorrowerHomeowner IncomeRange
## Employed :67322 False:56459 $25,000-49,999:32192
## Full-time :26355 True :57478 $50,000-74,999:31050
## Self-employed: 6134 $100,000+ :17337
## Not available: 5347 $75,000-99,999:16916
## Other : 3806 Not displayed : 7741
## (Other) : 2718 $1-24,999 : 7274
## NA's : 2255 (Other) : 1427
## DebtToIncomeRatio StateCapital LoanOriginationDate
## Min. : 0.000 CALIFORNIA:14717 2014-01-22 00:00:00: 491
## 1st Qu.: 0.140 TEXAS : 6842 2013-11-13 00:00:00: 490
## Median : 0.220 NEW YORK : 6729 2014-02-19 00:00:00: 439
## Mean : 0.276 FLORIDA : 6720 2013-10-16 00:00:00: 434
## 3rd Qu.: 0.320 ILLINOIS : 5921 2014-01-28 00:00:00: 339
## Max. :10.010 (Other) :67111 2013-09-24 00:00:00: 316
## NA's :8554 NA's : 5897 (Other) :111428
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Distribution of Original loan amounts is right tailed and it has 5 distinct peaks that needs to be investigated. Mean loan amount is 8337$
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Largest count of loans is still taged as current. From the past due loans the largest cathegory is past due between 1-15 days.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
## count ratio_4_times
## 1 367 0.3482535
Mean debth to income ratio is 0.28 and maximum is 10.010. The count to loans that have ratio more than 4 times compared to all loans (minus NA s) is 0.35%.
Length of loans are distributed in 3 cathegories, 1 3 and 5 year loans where count is 3>5>1 years.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
Distribution of Borrowers interest loans is quite narrow with a mean at 0.19%. Distribution is a slightly skewed to the left. The values it self seem a bit suspiceous. I would expect to see interest rates at least one magnitude higher from 0 to 5 percent so it could be an issue with the original dataset. I real case scenario I would have a direct dialogue with a domain expert in the topic:)
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
The highest income range for the borrowers is in two cathegories which in total range between 25 000 and 74 999 us dollars.
## Employed Full-time Not available Not employed Other
## 67322 26355 5347 835 3806
## Part-time Retired Self-employed NA's
## 1088 795 6134 2255
## [1] "ListingKey" "LoanOriginalAmount"
## [3] "BorrowerState" "Term"
## [5] "LoanStatus" "ClosedDate"
## [7] "BorrowerRate" "ListingCategory..numeric."
## [9] "Occupation" "EmploymentStatus"
## [11] "IsBorrowerHomeowner" "IncomeRange"
## [13] "DebtToIncomeRatio" "StateCapital"
## [15] "LoanOriginationDate"
90% of the borrowers that take loans are either Employed, Full-time eployed or Self-Employed
Top 5 occupations of borrowers were as following: Professional, Computer programmer, Executive, Teacher and Administrative assistant. Bottom of the list is dominated by students.
## False True
## 56459 57478
In the entire dataset number of homeowners vs the ones that do not own a home is very similar (False- 56459 vs True - 57478).
Distribution of years when the loan were originated. Note the peak in 2013.
Most common listing cathegory for a loan are Debt consolidation, Home improvement and Business, if we exclude Not available and other.
Subset of the data consists of 15 variables and is spread in 113937 rows. Data completeness report gave us an insight that most of the observations have values apart from the ClosedDate which is missing in 50% of the cases, and that is justifiable since it applies only to Cancelled, Completed, Chargedoff and Defaulted loan statuses.
I didn t manage to understand the root cause for the 871 of duplicated ListingKeys.
Main feature of interest in the dataset is the amount of loan.
Supporting features that I brought in externally are related to mapping out the US state codes into a human language. I did the same with the listing cathegories but remapping was done from an additional explanatory excel spreadsheet. Spatial context - geometry of the states would be of hel to see spatial distribution of loans and loan related variables.
Form of the data was not changed, i.e. data is in tidy format.
There is a positive relationship when comparing median as a metric between Loan amounts and 1, 3 and 5 years as lenght of loan. Higher to Term, higher the median loan amount.
Top 3 highest loans on average are targeting Debt consolidation, baby and adoption,and business, on the other side of the spectrum top 3 lowest loans on average are targeting student use, personal loans and buying an automobile.
Top 5 on average highest loans are initiated by folowing occupation groups: Judges, Pharmacists and Doctors (dataset filtered on more than 10 000USD loans).
What I expected to see when plotting Original Loan amount against depth to to income ratio is more negative correlation. What is clear in this plot iz that if we exclude >10 debt to income ratio, is that in a range of 1 to 10 of debt to income ratio there is very few data point in the space between 10 000 and 30 000 when comparing it to below 10 000 where the majority of the points lay i.e. burrowers with higher DTI ratio take lower loans.
Highest count of initiated loans in the datacomes from Professionals in the income groups: 50,000-74,999USD and 100,000+ USD (if we exclude combinations defined as Other).
It is evident that median BorrowerRate has been increasing over the years with a peak in 2011 and dropping again towards 2014. In addition there is a trend in narrowing IQR from 2010 onwards.
Medians for loan original amounts have been increasing since 2009 with a peak in 2014. In addition, there has almost not been outliers in a range over 30 000USD until year 2013 and 2014.
Main obesrvations:
positive relationship between increase of loan amounts medians and increase of length of loans - not a surprise
Medians for loan original amounts have been increasing since 2009 with a peak in 2014. In addition, there has almost no outliers in a range over 30 000USD until year 2013 and 2014.
It is evident that median BorrowerRate has been increasing over the years with a peak in 2011 and dropping again towards 2014. In addition there is a trend in narrowing IQR from 2010 onwards.
Top 5 on average highest loans are initiated by folowing occupation groups: Judges, Pharmacists and Doctors (dataset filtered on more than 10 000USD loans).
Top 3 highest loans on average are targeting Debt consolidation, baby and adoption and business, on the other side of the spectrum top 3 lowest loans on average are targeting student use, personal loans and buying an automobile.
Highest count of initiated loans in the datacomes from Professionals in the income groups: 50,000-74,999USD and 100,000+ USD (if we exclude combinations defined as Other)
THere is no in particular strong relationship found appart from the obvious ones which relate to OriginalLoanAmounts and levels of income, higher paid professions, increase in term length. I would benefit of having more in depth domain knowledge in order to scrutinize some of the more intricate relationships.
There is a lot of information in this graph which summarizes mean loans grouped per year faceted per Occupation and colored by a flag if borrower is a homeowner or not. What can be seen at a first glance is that means are either equal or bigger if a borrower is a homeowner. Moreover, means are increasing in general regardless of occupation over the period of last 5 years. Horisonal line on the graph is highlighting mean of 5000USD loans and by focusing on points below that line we can see that they belong to a Student groups which doesn’t seem to change over time significantly and colors are associated to a flag - is not a homeowner.
Along similar lines, if a borrower is a homeowner, he or she will have on average similar or higher loans than if a borrower is not a homeowner. It is interesting to see that some of the ListingCathegories are constrained to last 4 years and some are more expanding over almost entire time span of a dataset, like debt consolidation, home improvements, businesses and automobiles. In addition, personal loans and student use is absent in last 4-6 years.
There is a lot of information in two graphs in multivariate part which summarizes mean loans grouped per year faceted per Occupation and colored by a flag if borrower is a homeowner or not. What can be seen at a first glance is that means are either equal or bigger if a borrower is a homeowner. Moreover, means are increasing in general regardless of occupation over the period of last 5 years. Horisonal line on the graph is highlighting mean of 5000USD loans and by focusing on points below that line we can see that they belong to a Student groups which doesn’t seem to change over time significantly and colors are associated to a flag - is not a homeowner.
Along similar lines, if a borrower is a homeowner, he or she will have on average similar or higher loans than if a borrower is not a homeowner. It is interesting to see that some of the ListingCathegories are constrained to last 4 years and some are more expanding over almost entire time span of a dataset, like debt consolidation, home improvements, businesses and automobiles. In addition, personal loans and student use is absent in last 4-6 years.
Main feature of my interest was the maginute of original loan amounts Histogram shows count of the magnitude or the original loan amounts that borrowers have taken. What is evident from this plot is the skewness of the distribution with the right tail. In addition, There are several local peaks breaking the ideally right skewed distribution.
Plot two is focusing on showing a relationship of distributons of initial loan amounts vs year the loan is initiated. Medians for loan original amounts have been increasing since 2009 with a peak in 2014. In addition, there has almost absence of the outliers in a range over 30 000USD until year 2013 and 2014. All of the distributions are skewed.
Plot three carries a lot of information which enriches the plots from first and second plot with additional variables to see multivariate relationships with the main feature of interest - average original loan amounts per year, faceted based on the listing cathegory and colored by a variable which says if the borrower is already a homeowner or not.
What can be deduced is the following, if a borrower is a homeowner, he or she will have on average similar or higher loans than if a borrower is not a homeowner. It is interesting to see that some of the ListingCathegories are constrained to last 4 years and some are more expanding over almost entire time span of a dataset, like debt consolidation, home improvements, businesses and automobiles. In addition, personal loans and student use are completely absent in last 4-6 years.
Exploration of dataset was challenging because firstly due to the time constraints I had to pick and choose 10-15 variables from 81 of them which was not easy not having extensive domain knowledge on the topic.
Univariate analyses gave me room to explore distributions of most important variables in the dataset.
Bivariate analyses pointed me in directions where should I focus when exploring two variables at the same time and increased my curiousity to explore timeseries change.
Multivariate analyses in fully made me able to explore the relationships between time, magnitude of loans and two additional variables like home ownership and listing type in the same visualisation - very powerful tools.
Some of the strugles I encountered were lack of domain knowledge in the topic and wrangling the datasets, preparing the data so it is in right format for plotting to happen in the right manner.
What went pretty well is generally how I handeled the who EDA process:)
As a future insight I would like to explore the dataset more in spatial domain and also start including more relevant variables from the full dataset.
In addition I would like to spend more time into quantifying potential relationships that were detected during the visual inspections of plots.
http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
https://www.growingfamilybenefits.com/credit-scores-interest-rates-relationship/
http://t-redactyl.io/blog/2016/04/creating-plots-in-r-using-ggplot2-part-10-boxplots.html
https://admccarthy.github.io/Red_Wines/
https://docs.google.com/document/d/1-f3wM3mJSkoWxDmPjsyRnWvNgM57YUPloucOIl07l4c/pub